\chapter{Fault Tolerance}
\label{chap:fault-tolerance}

HyperDex recovers from failures automatically.  When a node fails, HyperDex
detects the failure, removes the failed node, and automatically reintegrates
other nodes to restore the desired level of fault tolerance.  The coordinator is
made fault tolerant using a replicated state machine.  It, too, is fault
tolerant and can tolerate a configurable number of failures.

Let's see both kinds of fault tolerance in action.

\section{Coordinator}
\label{chap:fault-tolerance:coordinator}

First, let's demonstrate the coordinator's fault tolerance.  Bring up a
singly-fault tolerant coordinator using the same procedure as in previous
chapters.  Note that this single node coordinator can shutdown and restart, but
is unavailable while it is shutdown for obvious reasons.

\begin{consolecode}
hyperdex coordinator -f -l 127.0.0.1 -p 1982 -D /path/to/coord1
\end{consolecode}

A HyperDex coordinator becomes more fault tolerant as more nodes are added.  A
simple majority of the nodes must remain available for the coordinator to make
progress.  For example, a cluster with three nodes can fail one node at any
point in time, while a cluster with five nodes can tolerate two simultaneous
failures.

Internally, the HyperDex coordinator uses state machine replication to provide
fault tolerance.  Running at least five servers is recommended so that the
cluster can tolerate at least two failures.  To do so, bring additional
coordinators online:

\begin{consolecode}
hyperdex coordinator -f -l 127.0.0.1 -p 1983 -c 127.0.0.1 -P 1982 -D /path/to/coord2
hyperdex coordinator -f -l 127.0.0.1 -p 1984 -c 127.0.0.1 -P 1982 -D /path/to/coord3
hyperdex coordinator -f -l 127.0.0.1 -p 1985 -c 127.0.0.1 -P 1982 -D /path/to/coord4
hyperdex coordinator -f -l 127.0.0.1 -p 1986 -c 127.0.0.1 -P 1982 -D /path/to/coord5
\end{consolecode}

For each new server in the cluster, the coordinator will automatically transfer
a copy of the initialized state machine and replicate further operations on
those nodes as well.  In its current form our cluster can tolerate two
simultaneous failures or planned shutdowns.  Go ahead and shutdown one or more
coordinator servers by pressing Control-C.  When restarted, they'll rejoin the
cluster and collect any state changes that occurred in their absence.

It's even possible to completely shutdown and restart the coordinator by
shutting down and restarting all nodes in the cluster.

\section{HyperDex Daemons}
\label{chap:fault-tolerance:daemons}

HyperDex daemons are organized by the coordinator.  In response to failures and
shutdowns, the coordinator automatically responds by assigning a new server to
take its place and initiating recovery procedures to ensure the new server
maintains an up-to-date view of the data.

To see this in action, we first need to start a cluster:

\begin{consolecode}
hyperdex daemon -f --listen=127.0.0.1 --listen-port=2012 \
                   --coordinator=127.0.0.1 --coordinator-port=1982 --data=/path/to/data1
hyperdex daemon -f --listen=127.0.0.1 --listen-port=2013 \
                   --coordinator=127.0.0.1 --coordinator-port=1983 --data=/path/to/data2
hyperdex daemon -f --listen=127.0.0.1 --listen-port=2014 \
                   --coordinator=127.0.0.1 --coordinator-port=1984 --data=/path/to/data3
\end{consolecode}

If you look carefully, you'll see that we've asked each daemon to connect to a
different coordinator server.  Each daemon connects to the coordinator in a
fault-tolerant manner.  Should the specified server fail after the daemon
connects, the daemon will simply connect to another server in the cluster.  When
a failure of either a coordinator node or a daemon occurs, the coordinator
ensures that all servers in the system will eventually be able to make progress.

Let's see this in action by creating some data items we care deeply about and
checking what happens to our data as failures occur:

\begin{pythoncode}
>>> a.add_space('''
... space phonebook
... key username
... attributes first, last, int phone
... subspace first, last, phone
... create 8 partitions
... tolerate 2 failures
... ''')
>>> c.put('phonebook', 'jsmith1', {'phone': 6075551024})
True
>>> c.get('phonebook', 'jsmith1')
{'first': 'John', 'last': 'Smith', 'phone': 6075551024}
\end{pythoncode}

So now we have a data item we deeply care about. We certainly would not want our
NoSQL store to lose this data item because of a failure. Let's create a failure
by killing one of the three hyperdaemon processes we started in the setup phase
of the tutorial. Feel free to use \code{kill -9}, there is no requirement that
the nodes shut down in an orderly fashion.  HyperDex is designed to handle a
configurable number of crash failures.

\begin{pythoncode}
>>> # kill a node at random
>>> c.get('phonebook', 'jsmith1')
{'first': 'John', 'last': 'Smith', 'phone': 6075551024}
>>> c.put('phonebook', 'jsmith1', {'phone': 6075551025})
True
>>> c.get('phonebook', 'jsmith1')
{'first': 'John', 'last': 'Smith', 'phone': 6075551025}
>>> c.put('phonebook', 'jsmith1', {'phone': 6075551026})
True
\end{pythoncode}

So, our data is alive and well. Not only that, but the subspace is continuing to
operate as normal and handling updates at its usual rate.

Let's kill one more server.

\begin{pythoncode}
>>> # kill a node at random
>>> c.get('phonebook', 'jsmith1')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "hyperclient.pyx", line 473, in hyperclient.Client.put ...
File "hyperclient.pyx", line 499, in hyperclient.Client.async_put ...
File "hyperclient.pyx", line 255, in hyperclient.DeferredInsert.__cinit__ ...
hyperclient.HyperClientException: Connection Failure
>>> c.get('phonebook', 'jsmith1')
{'first': 'John', 'last': 'Smith', 'phone': 6075551026}
\end{pythoncode}

Note that the HyperDex API exposes some failures to the clients at the moment,
so a client may have to catch HyperClientException and retry the operation.  The
HyperDex library does not resubmit failed operations on behalf of clients.  In
this example, behind the scenes, there were two node failures in the
triply-replicated space. Each failure was detected, the space was repaired by
cleaving out the failed node, and normal operations resumed without data loss.

\section{Fault Tolerance Thresholds}
\label{chap:fault-tolerance:thresholds}

HyperDex daemons and coordinators each tolerate a configurable number of
failures before the system fails completely.  For a desired level of fault
tolerance $f$, you need to start $f + 1$ daemons and $2 f + 1$ coordinators.
Both are able to tolerate more than $f$ failures so long as enough nodes rejoin
the cluster to bring the number of failures back under the failure threshold.

\section{Shutting Down and Restoring a Cluster}
\label{chap:fault-tolerance:reboot}

On occasion, you might need to completely shutdown a HyperDex cluster.  For
example, planned power outages or kernel upgrades might require such shutdowns.
Note that complete shutdowns, by definition, require every server process to be
killed and therefore violate the fault tolerance threshold.  Managing and
recovering from such shutdowns is in general a non-trivial process, but HyperDex
makes it easy to restart all processes without any data loss.

Shutting down a cluster is a three step process.  First, stop all client
traffic.  Second, kill all the daemon processes using \code{SIGHUP},
\code{SIGINT}, or \code{SIGTERM}.  The daemons will inform the coordinator of
their departure, and the coordinator will track which daemons departed
successfully.  Finally, kill each coordinator process using \code{SIGHUP},
\code{SIGINT}, or \code{SIGTERM}.

To restore a cluster, all you need to do is first restart the coordinator nodes
and then restart the daemon nodes.  Because state was cleanly saved to the disk,
this restart process will bring back the entire state of the database without
data loss.
